RNA-Seq Data Analysis ◾ 183
info (sample id, group, library size (lib.size), and normalization factors (norm.factor)) with
1 values. This data will be updated soon, and more slots will be added as well.
5.3.7.2 Annotation
The row names of the count data frame are the gene symbols as shown in Figure 5.5. For
some of the downstream analysis, we may need the count data to be annotated with the
NCBI Entrez IDs and full gene names which are not included in the count dataset at this
point. To add these annotations to the DGEList object (y), we need to make the Entrez IDs
as the row names instead of the gene symbols. To obtain the Entrez IDs and gene names,
we need to install and load the “org.Hs.eg.db” Bioconductor package, which is a genome-
wide annotation for human based on mapping using Entrez Gene identifiers [34]. You can
install and upload this package by running the following script on R prompt:
if (!require(“BiocManager”, quietly = TRUE))
install.packages(“BiocManager”)
BiocManager::install(“org.Hs.eg.db”)
library(org.Hs.eg.db)
FIGURE 5.6 The DGEList object of the count data.
FIGURE 5.5 The data frame after adding row and column names and removing rows with all zeros.